In a scientific research, we cannot prove our theoretical model is true. However, we can investigate what is the probablity that our theoretical model is not true. To measure and quantify this uncertainty, we employ probability theorem and statistics.

A Brief Introduction to Probability Theory

In social science, we are interested in studying the variations in one, sometime a set of, outcome across time and/or location: inter-state war, civil war, housing price, gender inequality, election results, and so forth. We often show this outcome variable, aka dependent variable, with \(Y\). Roughly speaking, a theory/hypothesis/proposition explain the association between the outcome variable \(Y\) with another random variable that is called explanatory variable, aka independent variable, exogenous variable, and feature, and is often shown by \(X\).

Virtually, all statistical methods employed by social scientists aims to find and explain what is the probability of observing a specific outcome \(Y=y\) given a specific value of independent variable \(X=x\). We can write this as: \(P(Y=y|X=x)\). Some simply write this as \(P(y|x)\). As you will see in this course, and you might saw it in your other advanced methods course, the statistical techniques and algorithms that we use can be mainly divided to two groups based on whether \(Y\) is continuous or discrete/categorical. The latter one is often called classification question. (note: do not mix classification with clustering. These are two different types of problems. We will discuss clustering in the final sessions of this course.)

The common practice in studying \(P(Y=y|X=x)\) directly by making an assumption about the functional form of the association between \(Y\) and \(X\). For instance, in linear regression models such as OLS, we assume that the association between \(Y\) and \(X\) is linear: \(y=\alpha+\beta x+\epsilon\). That said, you do not have to use probability models to develop a statistical learning model. You also can develop an algorithm and explore its performance. In fact, you can develop any algorithm you want, but you need to show that it works. \(k\) nearest neighbors (kNN) is one of these algorithms that does not need any probability model; we will cover kNN in this course. There is also another approach to study \(P(Y=y|X=x)\) using the Bayes Theorem, which is the topic of this session.

We will first briefly review very basic probability notations and concepts, then will learn the Bayes Theorem, and finally will study the Naive Bayes algorithm.

Discrete Random Variables

We are uncertain about the value of a random variable. Thus, we use a (statistical) distribution to describe it. A discrete random variable can only take a countable number of values, and a probability distribution shows the probability of realization of each value/outcome. For example, if we toss a fair coin twice, then the distribution of the number of heads (X) can be written as follow:

\(X\) \(P(X=x)\)
\(0\) \(.25\)
\(1\) \(.5\)
\(2\) \(.25\)

Question: What is \(P(X \geq 1)\)? What is \(P(X > 1)\)?

Probability mass/denisty function and cumulative denisty function

\(f(x)=P(X=x)\) is known as the probability mass function (PMF) when \(X\) is discrete, and the probability density function (PDF) when \(X\) is continuous. Also, the cumulative density function (CDF) is

\[F(x)=\sum_{k\leq x}f(k)~~if~~x \in \mathbb{Z}\]

and

\[F(x)=\int_{k\leq x}f(k)~~if~~x \in \mathbb{R}.\]

Some of the commonly used distributions are \(X\sim N(\mu,\sigma)\), \(X\sim Bernoulli(p)\), and \(X\sim Poisson(\lambda)\).

Conditional, joint, and marginal distributions

  • Joint (Probability) Distribution: Probability of observing two (or more) outcomes together is measured using joint probability distribution written as: \(P(X=x, Y=y)\). Of course, it can be extended to more than two outcomes:

\[P(X_1=x_1,X_2=x_2,\dots,X_N=x_N)\].

  • Conditional (Probability) Distribution: Intuitively, this means how knowing the values/outcomes of some variables affect our understanding about the values/outcomes of other variables. The simplest version of it for a two-variable case is P(Y=y|X=x), and the extended version of it is:

\[P(Y_1=y_1,Y=y_2,...,Y_L=y_L|X_1=x_1,X_2=x_2,\dots,X_M=x_M).\]

Suppose that we are interested in modeling the type of contentious politics,\(\{0=Peace, 1=Nonviolent~Protests, 2=Violent~Protests, 3=Civil~War\}\), in The Republic of Gilead2 next year given the economic condition. For the sake of simplicity assume that there are two economic conditions: boom and bust.

\(S\) \(P(S=s\|C=boom)\) \(S\) \(P(S=s\|C=bust)\)
Peace \(.80\) Peace \(.50\)
Nonviolent Protests \(.15\) Nonviolent Protests \(.25\)
Violent Protests \(.03\) Violent Protests \(.20\)
Civil war \(.02\) Civil war \(.05\)

We read P(S=Violent protests|E=bust)=.20: The probability/likelihood of “Violent protests” (S=Violent protests) conditional on/given a bad economy(C=bust) is \(.20\).

Outcome tree

Outcome tree

There are eight outcomes whose probabilities are conditional on the economic conditions:

\((C, S)\) \(P(S=s, C=c)\)
\((Boom, Peace)\) \(.16\)
\((Boom, Nonviolent)\) \(.07\)
\((Boom, Violent)\) \(.006\)
\((Boom, Civil~War)\) \(.004\)
\((Bust, Peace)\) \(.4\)
\((Bust, Nonviolent)\) \(.2\)
\((Bust, Violent)\) \(.16\)
\((Bust, Civil~War)\) \(.04\)

Bayes’ Theorem

We are interested in finding \(P(y|x)\), and the common practice by frequentist staticians is assuming a functional form for the asociation between \(x\) and \(y\), and then estimating the parametes of the function using different methods such as OLS, MLE, and so forth. However, there is another approach based on Bayes’ Theorem:

\[\begin{align} P(y|x)=\frac{P(x,y)}{p(x)}=\frac{P(x,y)}{\sum_y p(x,y)}=\frac{p(y)p(x|y)}{\sum_y p(y)p(x|y)} \end{align}\]

In other words, we can say Bayes’ Theorem suggest computing \(P(y|x)\) from \(P(y)\) and \(P(x|y)\).

Picnic Day Example: Assume that you are planning to go out for a picnic today. However, you find that the weather is cloudy when you wake up. 50% of rainy days start off cloudy. However, cloudy days are coomon in your city. 40% of days start cloudy. Further, season it is a dry month, you know that in average 3 of 30 days end to rain, i.e. 10%. The question is what is the liklihood of raining today when it starts cloudy/

Solution:

\[\begin{align} P(Rain|Cloud)=\frac{P(Cloud|Rain) P(Rain)}{P(Cloud)} = \frac{.1 \times .5}{.4}=.125=12.5\% \end{align}\]

Question: What is the posterior odds ratio if \(y={0,1}\)?

Extending Bayes’ Theorem to more than two variables

Often, we face a question which requires modeling more than two variables. We still can extend the relationship between the joint and condtional distributions to more than two variables:

\[\begin{align} P(Y_1=y_1, Y_2=y_2, Y_3=y_3)= P(Y_1=y_1) P(Y_2=y_2|Y_1=y_1) P(Y_3=y_3|Y_1=y_1, Y_2=y_2) \end{align}\]

Independence

Consider a set of random variables, \(\{x_1, x_2, \dots, x_k\}\). They are independent of each other if the conditional distibution of any of them is not dependent on our observation of any others.

Example: Suppose a fair coin is tossed 20 times. Let \(Y_i\) be 1 if the coin is a head at the \(i^{th}\) round, and \(0\) otherwise. What is \(P(Y_{11}=1)\)? What is \(P(Y_{11}=1|y_1=1,\dots,y_{10}=1)=?\)

If two variables are independent, the the relation between the joint and conditional distributions changes to:

\[\begin{align} P(Y=y|X=x)=P(Y=y) \end{align}\]

So,

\[\begin{align} P(Y=y|X=x)=P(Y=y)\\ P(X=x,Y=y)=P(Y=y|X=x) P(X=x)\\ =P(Y=y)P(X=x) \end{align}\]

Theorem: \(X\) and \(Y\) are independent if and only if \(P(x,y)=P(y|x)P(x)=p(y)p(x)\)

Identical distributions

Another common assumption in statistical analysis is that random variables are distributed identically. This means if we toss a coin 100 times, and \(Y_i=1\) and \(0\) otherwise, then \(P(Y_{30}=1)=P(Y_{85}=1)\)!

When random variables are distributed identically and independently, they are called IID distrubutions.

Naive Bayes Classification

Despite its other name, “Idiot’s” Bayes Classification, this method often shows a very good classification performance!

Assume that you have a categorical outcome \(Y\). As mentioned above, we are interested in finding \(P(y|x)\), that is what is the probability of \(Y\) conditional on observing \(X=x\).

For the sake of simpilicity, assume that \(Y=\{0,1\}\). Then, the common practice among political scientists is estimating a logitstic regression:

\[ P(y=1|x) \sim Bernoulli(p(x)), p(x)=\frac{e^{x'\beta}}{1+e^{x'\beta}} \]

where \(x=(x_1, x_2, \dots, x_p)'\) and \(\beta=(\beta_1, \beta_2, \dots, \beta_p)'\).

However, we can use Bayes Theorem to indirectly compute \(P(y|x)\). Naive Bayes classification makes a simplification assumption, which makes the name of this method naive: the elements of the features/explanatory variables of \(X=(X_1, X_2, \dots, X_p)\) are conditionally independent given Y:

\[ p(Y=k|X=x)=\frac{p(X=x|Y=y)p(Y=y)}{p(X=x)} \] For continius \(x\), we can re-write:

\[ p(Y=k|X=x)=\frac{\pi_k f_k(x)}{\sum_{l=1}^k f_l(x)} \]

Getting back to the discrete model; for a problem with \(X=(X_1, X_2, \dots, X_p)\), we have:

\[ p(Y=k|X=x)=\frac{p(\{x_1, x_2, \dots, x_p \}|Y=y)p(Y=y)}{p(X=x)} \]

Naive Bayes assumes that \(p(\{x_1, x_2, \dots, x_p \}|Y=y)=\prod_i p(x_i|y)\). Thus,

\[ p(y|x) \propto p(y)\prod_i p(x_i|y) \]

What is the interpretation of the above formula?

Next week: Spam or Ham


  1. Copyright 2019. This is an in-progress project, please do not cite or reproduce without my permission.

  2. The Republic of Gilead is the fictional country name in the Handmaid’ tale Tale, Margaret Atwood, 1985.